Ordination and Classification of Vegetation Types

Predicting Deciduous Broadleaf Forests using Climate Data

Author

Prepared by Alejandro Ordonez

Published

7 October 2025

1 Introduction

In this 90-minute practical, you will gain hands-on experience with multivariate ordination and classification techniques commonly used in ecology and biogeography. These methods help us understand how environmental variables structure biological communities and predict vegetation distributions.

Learning Objectives:

  • Apply Principal Component Analysis (PCA) to reduce dimensionality of environmental data
  • Perform Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS)
  • Use K-means clustering to classify vegetation types
  • Interpret ordination plots and assess classification accuracy
  • Predict the distribution of deciduous broadleaf forests based on climate variables

Dataset Overview:

You have three datasets:

  1. BiomeData.csv - Site locations (x, y coordinates) and biome classifications (500 sites per vegetation type)
  2. ClimateData.csv - 19 bioclimatic variables plus elevation for each site
  3. WorldClimateData.csv - Global climate data for prediction (not used in this practical)

The bioclimatic variables (bio.1 to bio.19) represent annual trends, seasonality, and extreme environmental factors derived from temperature and precipitation data.

2 Introduction

In this 90-minute practical, you will gain hands-on experience with multivariate ordination and classification techniques commonly used in ecology and biogeography. These methods help us understand how environmental variables structure biological communities and predict vegetation distributions.

Learning Objectives:

  • Apply Principal Component Analysis (PCA) to reduce dimensionality of environmental data
  • Perform Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS)
  • Use K-means clustering to classify vegetation types
  • Interpret ordination plots and assess classification accuracy
  • Predict the distribution of deciduous broadleaf forests based on climate variables

Dataset Overview:

You have three datasets:

  1. BiomeData.csv - Site locations (x, y coordinates) and biome classifications (500 sites per vegetation type)
  2. ClimateData.csv - 19 bioclimatic variables plus elevation for each site
  3. WorldClimateData.csv - Global climate data for prediction (not used in this practical)

The bioclimatic variables (bio.1 to bio.19) represent annual trends, seasonality, and extreme environmental factors derived from temperature and precipitation data.

WorldClim Bioclimatic Variables
Variable Name Description Units
bio.1 Annual Mean Temperature Mean of all weekly mean temperatures °C
bio.2 Mean Diurnal Range Mean of monthly (max temp - min temp) °C
bio.3 Isothermality (bio.2/bio.7) × 100 - day-to-night temperature oscillations relative to annual oscillations %
bio.4 Temperature Seasonality Standard deviation of weekly mean temperatures °C × 100
bio.5 Max Temperature of Warmest Month Highest temperature of any monthly maximum temperature °C
bio.6 Min Temperature of Coldest Month Lowest temperature of any monthly minimum temperature °C
bio.7 Temperature Annual Range bio.5 - bio.6 °C
bio.8 Mean Temperature of Wettest Quarter Mean temperature during the wettest quarter °C
bio.9 Mean Temperature of Driest Quarter Mean temperature during the driest quarter °C
bio.10 Mean Temperature of Warmest Quarter Mean temperature during the warmest quarter °C
bio.11 Mean Temperature of Coldest Quarter Mean temperature during the coldest quarter °C
bio.12 Annual Precipitation Sum of monthly precipitation values mm
bio.13 Precipitation of Wettest Month Precipitation of the wettest month mm
bio.14 Precipitation of Driest Month Precipitation of the driest month mm
bio.15 Precipitation Seasonality Coefficient of variation of monthly precipitation %
bio.16 Precipitation of Wettest Quarter Total precipitation during the wettest quarter mm
bio.17 Precipitation of Driest Quarter Total precipitation during the driest quarter mm
bio.18 Precipitation of Warmest Quarter Total precipitation during the warmest quarter mm
bio.19 Precipitation of Coldest Quarter Total precipitation during the coldest quarter mm

Note: Temperature variables are in degrees Celsius (°C), precipitation variables are in millimeters (mm). Quarter refers to a period of 3 consecutive months.


3 Task 1: Principal Component Analysis (PCA)

3.1 Background and Objectives

What is PCA?

Principal Component Analysis is an ordination technique that transforms correlated environmental variables into a smaller set of uncorrelated variables called principal components (PCs). Each PC is a linear combination of the original variables and captures a decreasing amount of variance in the dataset.

Why use PCA?

  • Reduces dimensionality while retaining most variation in the data
  • Identifies which environmental variables drive differences between sites
  • Visualizes multidimensional data in 2D or 3D space
  • Works best with linear relationships between variables

What we will do:

  1. Load and prepare the climate and biome data
  2. Standardize environmental variables (important because variables have different units)
  3. Perform PCA on the environmental data
  4. Determine how many PC axes to retain using the “elbow method” and “broken stick” model
  5. Interpret variable loadings to understand what each axis represents
  6. Visualize the ordination space colored by biome type
  7. Add convex hulls to show the environmental space occupied by each biome

3.2 Step 1.1: Load and Prepare Data

What this step does: In this step, we import two CSV files containing biome classifications and climate data, merge them by their common ID column so that each site has both its biome label and all environmental variables, and extract just the environmental variables (the 19 bioclimatic variables plus elevation) into a separate data frame for analysis. We also check for any missing values that could cause problems in our analysis.

Why we do this: Before running any analysis, we need to combine our ecological classification data (biomes) with the environmental predictors (climate variables). By merging the datasets, we create a complete picture where each site has both its identity (what biome it belongs to) and its environmental conditions. Checking for missing values is essential because most statistical methods cannot handle missing data and would either fail or produce unreliable results.

# Load the data
biome_data <- read.csv("BiomeData.csv", row.names = 1)
climate_data <- read.csv("ClimateData.csv", row.names = 1)

# Check structure
head(biome_data)
head(climate_data[, 1:6])
# Merge datasets by ID
full_data <- merge(biome_data, climate_data, by = "ID")

# Extract environmental variables (bio.1 to Elev)
env_vars <- full_data[, c("bio.1", "bio.2", "bio.3", "bio.4", "bio.5", 
                          "bio.6", "bio.7", "bio.8", "bio.9", "bio.10",
                          "bio.11", "bio.12", "bio.13", "bio.14", "bio.15",
                          "bio.16", "bio.17", "bio.18", "bio.19", "Elev")]
# Check for missing values
sum(is.na(env_vars))
[1] 40
# Remove rows with missing values 
env_vars <- env_vars[!is.na(apply(env_vars,1,sum)),]

3.3 Step 1.2: Perform PCA

What this step does: Here we perform the actual Principal Component Analysis using R’s prcomp() function with scale. = TRUE, which standardizes each variable to have mean = 0 and standard deviation = 1. We then calculate how much variance each principal component explains and display this information in a summary table showing both individual and cumulative variance explained.

Why we do this: Scaling is critical because our bioclimatic variables have very different units and scales (e.g., temperature in °C ranging from -50 to 50 vs precipitation in mm ranging from 0 to 10,000). Without scaling, variables with larger numeric ranges would dominate the PCA simply due to their scale, not their ecological importance. Standardization ensures all variables contribute equally to the analysis. Understanding variance explained helps us know how much information each PC captures and guides decisions about how many components to retain.

# Standardize the data (scale = TRUE centers and scales variables)
pca_result <- prcomp(env_vars, scale. = TRUE)

# Summary of PCA
summary(pca_result)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     3.1979 1.8836 1.4892 1.1628 1.01421 0.80392 0.59986
Proportion of Variance 0.5113 0.1774 0.1109 0.0676 0.05143 0.03231 0.01799
Cumulative Proportion  0.5113 0.6887 0.7996 0.8672 0.91865 0.95097 0.96896
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.46619 0.42274 0.34035 0.24722 0.13515 0.12363 0.07811
Proportion of Variance 0.01087 0.00894 0.00579 0.00306 0.00091 0.00076 0.00031
Cumulative Proportion  0.97983 0.98876 0.99455 0.99761 0.99852 0.99929 0.99959
                          PC15    PC16    PC17    PC18     PC19      PC20
Standard deviation     0.05573 0.05508 0.04133 0.01569 0.008455 3.057e-08
Proportion of Variance 0.00016 0.00015 0.00009 0.00001 0.000000 0.000e+00
Cumulative Proportion  0.99975 0.99990 0.99998 1.00000 1.000000 1.000e+00
# Variance explained by each PC - BY HAND
variance_explained <- (pca_result$sdev^2) / sum(pca_result$sdev^2) * 100
cumulative_variance <- cumsum(variance_explained)

# Display variance table - By hand
variance_table <- data.frame(
  PC = paste0("PC", 1:length(variance_explained)),
  Variance = round(variance_explained, 2),
  Cumulative = round(cumulative_variance, 2)
)
print(variance_table[1:10, ])
     PC Variance Cumulative
1   PC1    51.13      51.13
2   PC2    17.74      68.87
3   PC3    11.09      79.96
4   PC4     6.76      86.72
5   PC5     5.14      91.87
6   PC6     3.23      95.10
7   PC7     1.80      96.90
8   PC8     1.09      97.98
9   PC9     0.89      98.88
10 PC10     0.58      99.46

3.4 Step 1.3: Determine Number of Axes to Retain

What this step does: This step helps us decide how many principal components to keep for further analysis using two complementary methods:

  1. Scree Plot (Elbow Method): We plot the variance explained by each PC and look for where the curve “flattens out” - this is the elbow. Components before the elbow capture meaningful patterns; those after mostly capture noise.

  2. Broken Stick Model: This creates a null expectation by asking “if variance were distributed randomly among components, how much would each get?” We only keep components that explain MORE variance than this random expectation.

Why we do this: Not all principal components are equally informative. The first few PCs capture major patterns in the data, while later PCs often represent noise or very minor variations. Retaining too many components defeats the purpose of dimensionality reduction and can introduce noise into downstream analyses. Retaining too few loses important information. These two methods provide objective, complementary criteria: the elbow method is visual and intuitive, while the broken stick model provides a statistical threshold. Together, they help us find the optimal balance between simplification and information retention.

Elbow Method: Look for the “elbow” in the scree plot where variance explained drops off

Broken Stick Model: Compare observed variance to a null model where variance is randomly distributed

# Scree plot (Elbow method)
plot(1:20, variance_explained[1:20], 
     type = "b", pch = 19, col = "blue",
     xlab = "Principal Component", 
     ylab = "Variance Explained (%)",
     main = "Scree Plot - Elbow Method")
grid()

# Broken stick model - By hand
broken_stick <- function(n) {
  result <- numeric(n)
  for (i in 1:n) {
    result[i] <- (1/n) * sum(1/(i:n))
  }
  return(result * 100)
}
bs_values <- broken_stick(20)

# Plot comparison
plot(1:20, variance_explained[1:20], 
     type = "b", pch = 19, col = "blue",
     xlab = "Principal Component", 
     ylab = "Variance Explained (%)",
     main = "PCA: Observed vs Broken Stick Model",
     ylim = c(0, max(variance_explained[1:20])))
lines(1:20, bs_values, type = "b", pch = 17, col = "red")
legend("topright", legend = c("Observed", "Broken Stick"),
       col = c("blue", "red"), pch = c(19, 17), lty = 1)
grid()

# Number of axes above broken stick
n_axes_retain <- sum(variance_explained > bs_values)
cat("\nNumber of PCs to retain (Broken Stick):", n_axes_retain)

Number of PCs to retain (Broken Stick): 3

3.5 Step 1.4: Interpret Variable Loadings

What this step does: Variable loadings tell us which environmental variables contribute most to each principal component. A loading is the correlation between the original variable and the PC axis. In this step, we extract the loadings for the first 3 PCs, print the complete loading matrix, and identify the top 5 contributing variables for each PC by sorting their absolute values.

Why we do this: Raw principal components are abstract mathematical constructs - they’re just weighted combinations of our original variables. Interpretation transforms these abstractions into ecologically meaningful environmental gradients. For example, if PC1 has high loadings from temperature variables, we can interpret it as a “temperature gradient.” If PC2 loads heavily on precipitation seasonality variables, it represents “precipitation variability.” This interpretation is crucial for understanding what ecological or climatic factors actually differentiate our sites and biomes. High positive or negative loadings (far from zero) indicate strong contributions to that axis.

# Extract loadings for first 3 PCs
loadings_pc <- pca_result$rotation[, 1:3]
print(round(loadings_pc, 3))
          PC1    PC2    PC3
bio.1  -0.286 -0.203 -0.078
bio.2  -0.050 -0.360  0.085
bio.3  -0.260  0.003  0.247
bio.4   0.248 -0.023 -0.310
bio.5  -0.238 -0.286 -0.214
bio.6  -0.298 -0.122  0.020
bio.7   0.229 -0.124 -0.280
bio.8  -0.244 -0.215 -0.213
bio.9  -0.271 -0.160  0.035
bio.10 -0.250 -0.255 -0.216
bio.11 -0.296 -0.150  0.032
bio.12 -0.249  0.286  0.038
bio.13 -0.235  0.159  0.199
bio.14 -0.147  0.361 -0.247
bio.15  0.040 -0.161  0.501
bio.16 -0.240  0.172  0.181
bio.17 -0.155  0.366 -0.236
bio.18 -0.195  0.211  0.011
bio.19 -0.166  0.289 -0.045
Elev    0.137  0.043  0.408
# Identify top five contributing variables for each PC
for (i in 1:3) {
  cat("\n--- PC", i, "---\n")
  sorted_loadings <- sort(abs(loadings_pc[, i]), decreasing = TRUE)
  top_vars <- names(sorted_loadings)[1:5]
  cat("Top 5 variables:", paste(top_vars, collapse = ", "), "\n")
  print(round(loadings_pc[top_vars, i], 3))
}

--- PC 1 ---
Top 5 variables: bio.6, bio.11, bio.1, bio.9, bio.3 
 bio.6 bio.11  bio.1  bio.9  bio.3 
-0.298 -0.296 -0.286 -0.271 -0.260 

--- PC 2 ---
Top 5 variables: bio.17, bio.14, bio.2, bio.19, bio.12 
bio.17 bio.14  bio.2 bio.19 bio.12 
 0.366  0.361 -0.360  0.289  0.286 

--- PC 3 ---
Top 5 variables: bio.15, Elev, bio.4, bio.7, bio.14 
bio.15   Elev  bio.4  bio.7 bio.14 
 0.501  0.408 -0.310 -0.280 -0.247 

3.6 Step 1.5: Visualize PCA Ordination

What this step does: We visualize the PCA results by plotting the sites in the reduced ordination space. Each point represents one of our 8,000 sites, and we color points by their biome type. The code creates three 2D plots showing different combinations of principal components: PC1 vs PC2, PC1 vs PC3, and PC2 vs PC3.

Why we do this: Visualization is essential for understanding the relationships among sites and biomes in environmental space. Points that are close together have similar environmental conditions, while distant points differ substantially. By coloring points by biome, we can assess whether sites from the same biome cluster together (suggesting distinct environmental niches) or mix with other biomes (suggesting overlapping environmental tolerances). Different PC combinations reveal different aspects of environmental variation - PC1 vs PC2 usually shows the dominant gradients, while other combinations can reveal secondary patterns. The axis labels showing variance explained help you understand how much information each view contains.

# Extract PC scores
pc_scores <- pca_result$x

# Create color palette for biomes
biome_types <- unique(full_data$BIOME.Name)
n_biomes <- length(biome_types)
colors <- rainbow(n_biomes)
color_map <- setNames(colors, biome_types)

# 2D plot: PC1 vs PC2
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

plot(pc_scores[, 1], pc_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PC1 (", round(variance_explained[1], 1), "%)"),
     ylab = paste0("PC2 (", round(variance_explained[2], 1), "%)"),
     main = "PCA: PC1 vs PC2")

# 2D plot: PC1 vs PC3
plot(pc_scores[, 1], pc_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PC1 (", round(variance_explained[1], 1), "%)"),
     ylab = paste0("PC3 (", round(variance_explained[3], 1), "%)"),
     main = "PCA: PC1 vs PC3")

# 2D plot: PC2 vs PC3
plot(pc_scores[, 2], pc_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PC2 (", round(variance_explained[2], 1), "%)"),
     ylab = paste0("PC3 (", round(variance_explained[3], 1), "%)"),
     main = "PCA: PC2 vs PC3")
# legend
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)


3.7 Step 1.6: Add Convex Hulls

What this step does: Convex hulls are polygons that enclose all points belonging to each biome, like drawing a rubber band around each group. The chull() function finds the outermost points for each biome group, and polygon() draws lines connecting them. We draw hulls without fill (using col = NA) so you can see overlapping regions clearly.

Why we do this: Convex hulls help visualize three important ecological patterns: 1. The total environmental space occupied by each biome - larger hulls indicate biomes that span wider environmental gradients 2. How much different biomes overlap - overlapping hulls suggest similar environmental tolerances or transitional zones 3. Which biomes are environmentally distinct - non-overlapping hulls indicate biomes occupying unique environmental niches

This visualization is powerful for understanding niche differentiation and ecological segregation. Biomes with completely separated hulls occupy fundamentally different environmental spaces, while those with extensive overlap may be differentiated by factors beyond climate (such as soil, disturbance history, or biogeographic barriers) or may represent arbitrary divisions along continuous gradients.

# Function to add convex hulls
add_hulls <- function(scores_x, scores_y, groups, colors) {
  for (i in seq_along(unique(groups))) {
    group <- unique(groups)[i]
    subset_idx <- which(groups == group)
    hull_idx <- chull(scores_x[subset_idx], scores_y[subset_idx])
    hull_points <- subset_idx[hull_idx]
    polygon(scores_x[hull_points], scores_y[hull_points],
            border = colors[i], lwd = 2, col = NA)
  }
}

# Plot with convex hulls
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

# PC1 vs PC2 with hulls
plot(pc_scores[, 1], pc_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PC1 (", round(variance_explained[1], 1), "%)"),
     ylab = paste0("PC2 (", round(variance_explained[2], 1), "%)"),
     main = "PCA with Convex Hulls: PC1 vs PC2")
add_hulls(pc_scores[, 1], pc_scores[, 2],
          full_data$BIOME.Name[as.numeric(row.names(env_vars))],
          colors)




# PC1 vs PC3 with hulls
plot(pc_scores[, 1], pc_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PC1 (", round(variance_explained[1], 1), "%)"),
     ylab = paste0("PC3 (", round(variance_explained[3], 1), "%)"),
     main = "PCA with Convex Hulls: PC1 vs PC3")
add_hulls(pc_scores[, 1], pc_scores[, 3], 
          full_data$BIOME.Name[as.numeric(row.names(env_vars))],
          colors)
# Add legend
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)


3.8 Reflection Questions - Task 1

Question 1.1: Based on the scree plot and broken stick model, how many principal components would you retain for further analysis? Explain your reasoning and discuss what percentage of the total variance is explained by these components.

Question 1.2: Examine the variable loadings for the first two principal components. What environmental gradients do PC1 and PC2 appear to represent? Which bioclimatic variables contribute most strongly to each axis, and what does this tell you about the main environmental factors differentiating these biomes?


4 Task 2: Principal Coordinates Analysis (PCoA)

4.1 Background and Objectives

What is PCoA?

Principal Coordinates Analysis (also called metric multidimensional scaling) is an ordination technique that works with distance or dissimilarity matrices rather than raw data. Unlike PCA, which assumes linear relationships, PCoA can handle non-linear relationships through the choice of dissimilarity measure.

Why use Euclidean distance?

Euclidean distance is the straight-line distance between two points in multidimensional space. It’s appropriate for continuous environmental data and, when used with scaled variables, produces results similar to PCA but through a distance-based framework. This allows us to understand how distance-based ordination methods work while maintaining comparability with our PCA results.

What we will do:

  1. Scale environmental variables to ensure equal contribution
  2. Calculate Euclidean distance matrix
  3. Perform PCoA on the distance matrix
  4. Determine how many PCo axes to retain
  5. Visualize the ordination colored by biome
  6. Add convex hulls to delineate biome spaces

4.2 Step 2.1: Scale Environmental Data

What this step does: We standardize each environmental variable to have mean = 0 and standard deviation = 1 using the scale() function. This transformation ensures that all variables are on comparable scales before calculating distances.

Why we do this: Just like in PCA, scaling is crucial when working with variables measured in different units and ranges. Without scaling, variables with larger numeric ranges (like precipitation in mm) would dominate the distance calculations over variables with smaller ranges (like temperature in °C). Scaling ensures that each environmental variable contributes proportionally to the distance matrix based on its ecological variation, not its measurement units. This prevents arbitrary scale differences from distorting our understanding of ecological relationships.

# Scale environmental variables
env_scaled <- scale(env_vars)

# Check the transformation
cat("Original data - first 5 rows and columns:\n")
Original data - first 5 rows and columns:
print(head(env_vars[, 1:5]))
     bio.1     bio.2    bio.3     bio.4    bio.5
1 25.46549  9.390896 79.71220  54.08367 32.10250
2 24.50580  8.208070 64.66431 160.87517 30.78639
3 10.64886 10.517187 85.23533  42.03983 16.89175
4 25.52124  9.814855 83.98456  65.63322 31.80350
5 17.62540  7.354917 27.39363 728.02069 30.49400
6 21.27541  8.922480 84.69770  51.60979 26.56850
cat("\nScaled data - first 5 rows and columns:\n")

Scaled data - first 5 rows and columns:
print(head(env_scaled[, 1:5]))
       bio.1      bio.2      bio.3       bio.4      bio.5
1 0.89870849 -0.5415952  1.7612063 -1.31188242  0.5084503
2 0.84573063 -0.9250235  1.0277237 -1.09271373  0.4189001
3 0.08078172 -0.1764933  2.0304216 -1.33660005 -0.5265138
4 0.90178618 -0.4041635  1.9694550 -1.28817923  0.4881057
5 0.46590942 -1.2015842 -0.7889689  0.07124171  0.3990053
6 0.66740204 -0.6934384  2.0042158 -1.31695957  0.1319080
cat("\nMeans after scaling (should be ~0):\n")

Means after scaling (should be ~0):
print(round(colMeans(env_scaled), 10))
 bio.1  bio.2  bio.3  bio.4  bio.5  bio.6  bio.7  bio.8  bio.9 bio.10 bio.11 
     0      0      0      0      0      0      0      0      0      0      0 
bio.12 bio.13 bio.14 bio.15 bio.16 bio.17 bio.18 bio.19   Elev 
     0      0      0      0      0      0      0      0      0 
cat("\nStandard deviations after scaling (should be 1):\n")

Standard deviations after scaling (should be 1):
print(round(apply(env_scaled, 2, sd), 10))
 bio.1  bio.2  bio.3  bio.4  bio.5  bio.6  bio.7  bio.8  bio.9 bio.10 bio.11 
     1      1      1      1      1      1      1      1      1      1      1 
bio.12 bio.13 bio.14 bio.15 bio.16 bio.17 bio.18 bio.19   Elev 
     1      1      1      1      1      1      1      1      1 

4.3 Step 2.2: Calculate Euclidean Distance

What this step does: We calculate the Euclidean distance between every pair of sites using the scaled environmental data. The Euclidean distance between two sites is the square root of the sum of squared differences across all environmental variables. R’s dist() function efficiently computes this for all pairwise combinations.

Why we do this: The Euclidean distance matrix quantifies how environmentally similar or different each pair of sites is. Sites with small distances have similar environmental conditions across all climate variables; sites with large distances have very different conditions. This distance matrix becomes the input for PCoA, which will find a coordinate system that best preserves these distances. Unlike PCA which works directly with the data matrix, distance-based methods like PCoA offer flexibility in how we measure similarity - we could use other distance metrics for different data types or ecological questions.

# Calculate Euclidean distance matrix
cat("Calculating Euclidean distance matrix...\n")
Calculating Euclidean distance matrix...
euclidean_dist <- dist(env_scaled, method = "euclidean")
euclidean_dist[1:10]
 [1] 3.656405 5.666932 3.250639 4.969389 4.820121 3.941225 5.851185 2.139402
 [9] 3.361269 3.761787
ade4::is.euclid(euclidean_dist)
[1] TRUE
# Convert to matrix for inspection
dist_matrix <- as.matrix(euclidean_dist)

# Show example distances between first few sites
cat("\nExample: Distances between first 5 sites:\n")

Example: Distances between first 5 sites:
print(round(dist_matrix[1:5, 1:5], 2))
     1    2    3    4    5
1 0.00 3.66 5.67 3.25 4.97
2 3.66 0.00 4.17 2.55 4.26
3 5.67 4.17 0.00 5.21 6.11
4 3.25 2.55 5.21 0.00 5.20
5 4.97 4.26 6.11 5.20 0.00
# Display summary statistics
cat("\nDistance matrix summary:\n",
    "Dimensions:", nrow(dist_matrix), "x", ncol(dist_matrix), "\n",
    "Min distance:", round(min(euclidean_dist), 3), "\n",
    "Max distance:", round(max(euclidean_dist), 3), "\n",
    "Mean distance:", round(mean(euclidean_dist), 3), "\n",
    "Median distance:", round(median(euclidean_dist), 3), "\n")

Distance matrix summary:
 Dimensions: 1598 x 1598 
 Min distance: 0.035 
 Max distance: 20.283 
 Mean distance: 5.727 
 Median distance: 5.282 

4.4 Step 2.3: Perform PCoA

What this step does: Principal Coordinates Analysis (PCoA) takes our distance matrix and finds a coordinate system where the Euclidean distances between points best match the original environmental distances. We use the cmdscale() function (classical multidimensional scaling) to perform PCoA, requesting 20 dimensions and eigenvalues.

Why we do this: PCoA transforms the distance matrix into an ordination space where we can visualize relationships among sites. It finds axes (principal coordinates) that maximize the correspondence between distances in the ordination and the original environmental distances. Unlike PCA which decomposes variance in the data matrix, PCoA decomposes distances. However, when using Euclidean distance on scaled data, PCoA produces results mathematically equivalent to PCA - this demonstrates the connection between these two fundamental ordination methods. PCoA’s distance-based framework also makes it more flexible for non-Euclidean distances used with other data types.

# Perform PCoA using cmdscale
cat("Performing PCoA...\n")
Performing PCoA...
pcoa_result <- cmdscale(euclidean_dist, k = 20, eig = TRUE)


# Extract positive eigenvalues
positive_eig <- pcoa_result$eig[pcoa_result$eig > 0]
variance_exp_pcoa <- (positive_eig / sum(positive_eig)) * 100
cumulative_var_pcoa <- cumsum(variance_exp_pcoa)

# Display variance table
variance_table_pcoa <- data.frame(
  PCo = paste0("PCo", 1:length(variance_exp_pcoa)),
  Variance = round(variance_exp_pcoa, 2),
  Cumulative = round(cumulative_var_pcoa, 2)
)

cat("\nVariance explained by Principal Coordinates:\n")

Variance explained by Principal Coordinates:
print(variance_table_pcoa[1:10, ])
     PCo Variance Cumulative
1   PCo1    51.13      51.13
2   PCo2    17.74      68.87
3   PCo3    11.09      79.96
4   PCo4     6.76      86.72
5   PCo5     5.14      91.87
6   PCo6     3.23      95.10
7   PCo7     1.80      96.90
8   PCo8     1.09      97.98
9   PCo9     0.89      98.88
10 PCo10     0.58      99.46
cat("\nNote: With Euclidean distance on scaled data, PCoA results are")

Note: With Euclidean distance on scaled data, PCoA results are
cat("\nmathematically equivalent to PCA. Compare these values to your PCA results!\n")

mathematically equivalent to PCA. Compare these values to your PCA results!

4.5 Step 2.4: Determine Number of Axes to Retain

What this step does: We use the same two methods as in PCA to determine how many PCoA axes to retain: the scree plot (elbow method) to visualize where variance levels off, and the broken stick model to identify axes that explain more variance than expected by chance.

Why we do this: Not all principal coordinates are equally informative. The first few axes capture the major environmental gradients differentiating sites, while later axes represent progressively smaller patterns or noise. The broken stick model provides an objective statistical threshold - only axes exceeding this null expectation represent meaningful patterns rather than random variation. This step ensures we focus on the most important axes for downstream analyses like clustering, avoiding the inclusion of noise that could obscure true patterns. The scree plot gives visual intuition, while the broken stick provides quantitative guidance.

par(mfrow=c(1,1))
# Scree plot
plot(1:20, variance_exp_pcoa[1:20], 
     type = "b", pch = 19, col = "darkgreen",
     xlab = "Principal Coordinate", 
     ylab = "Variance Explained (%)",
     main = "PCoA Scree Plot")
grid()

# Broken stick comparison
bs_values_pcoa <- broken_stick(20)

plot(1:20, variance_exp_pcoa[1:20], 
     type = "b", pch = 19, col = "darkgreen",
     xlab = "Principal Coordinate", 
     ylab = "Variance Explained (%)",
     main = "PCoA: Observed vs Broken Stick Model",
     ylim = c(0, max(variance_exp_pcoa[1:20])))
lines(1:20, bs_values_pcoa, type = "b", pch = 17, col = "red")
legend("topright", legend = c("Observed", "Broken Stick"),
       col = c("darkgreen", "red"), pch = c(19, 17), lty = 1)
grid()

n_axes_pcoa <- sum(variance_exp_pcoa[1:20] > bs_values_pcoa)
cat("\nNumber of PCo axes to retain (Broken Stick):", n_axes_pcoa, "\n")

Number of PCo axes to retain (Broken Stick): 3 
cat("Cumulative variance explained by retained axes:", 
    round(cumulative_var_pcoa[n_axes_pcoa], 2), "%\n")
Cumulative variance explained by retained axes: 79.96 %

4.6 Step 2.5: Visualize PCoA Ordination

What this step does: We visualize the PCoA ordination space by plotting sites using their coordinates on different pairs of PCo axes, with points colored by biome type. We create three different 2D views: PCo1 vs PCo2, PCo1 vs PCo3, and PCo2 vs PCo3.

Why we do this: Visualization reveals the spatial arrangement of sites and biomes in environmental space. Points positioned close together have similar environmental conditions (small Euclidean distances), while distant points differ substantially in their climate profiles. By coloring points by biome, we can assess whether sites from the same biome cluster together in environmental space - indicating distinct climatic niches - or intermix with other biomes - suggesting overlapping environmental tolerances or transitional zones. Different axis combinations reveal different aspects of environmental variation: PCo1 vs PCo2 typically shows the dominant gradients (temperature, precipitation), while other combinations can reveal secondary patterns. The variance percentages on axis labels help you judge how much ecological information each view contains.

# Extract PCo scores
pcoa_scores <- pcoa_result$points

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

# PCo1 vs PCo2
plot(pcoa_scores[, 1], pcoa_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PCo1 (", round(variance_exp_pcoa[1], 1), "%)"),
     ylab = paste0("PCo2 (", round(variance_exp_pcoa[2], 1), "%)"),
     main = "PCoA: PCo1 vs PCo2")


# PCo1 vs PCo3
plot(pcoa_scores[, 1], pcoa_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PCo1 (", round(variance_exp_pcoa[1], 1), "%)"),
     ylab = paste0("PCo3 (", round(variance_exp_pcoa[3], 1), "%)"),
     main = "PCoA: PCo1 vs PCo3")

# PCo2 vs PCo3
plot(pcoa_scores[, 2], pcoa_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = paste0("PCo2 (", round(variance_exp_pcoa[2], 1), "%)"),
     ylab = paste0("PCo3 (", round(variance_exp_pcoa[3], 1), "%)"),
     main = "PCoA: PCo2 vs PCo3")
# Legend 
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)

cat("\nNote: Because we used Euclidean distance on scaled data, these plots")

Note: Because we used Euclidean distance on scaled data, these plots
cat("\nshould look very similar to your PCA plots (possibly reflected or rotated).\n")

should look very similar to your PCA plots (possibly reflected or rotated).

4.7 Reflection Questions - Task 2

Question 2.1: Compare the PCoA ordination to the PCA ordination. Since we used Euclidean distance on scaled data, the results should be mathematically equivalent (though possibly reflected or rotated). Do you observe this equivalence? What does this tell you about the relationship between PCA and distance-based ordination methods?

Question 2.2: Looking at the convex hulls in the PCoA plots, which biomes show the most overlap in environmental space? Which are most distinct? What might this tell you about the environmental specificity of different vegetation types? Consider both the size of the hulls (environmental breadth) and their positions (environmental optima).


5 Task 3: Non-metric Multidimensional Scaling (NMDS)

5.1 Background and Objectives

What is NMDS?

Non-metric Multidimensional Scaling is an iterative ordination technique that attempts to represent dissimilarities between objects in a low-dimensional space. Unlike PCA and PCoA, NMDS uses rank orders rather than absolute distances, making it robust to non-linear relationships.

Understanding Stress:

Stress measures how well the ordination distances match the original dissimilarities. Lower stress indicates better fit: - Stress < 0.05: Excellent representation - Stress < 0.10: Good representation - Stress < 0.20: Acceptable - Stress > 0.20: Poor representation (use more dimensions)

What we will do:

  1. Load the vegan package for robust NMDS implementation
  2. Perform NMDS with different numbers of dimensions (2, 3, 4, 5)
  3. Plot how stress changes with dimensionality
  4. Select the optimal number of dimensions
  5. Visualize the NMDS ordination colored by biome
  6. Add convex hulls to show biome distributions

5.2 Step 3.1: Run NMDS with Different Dimensions

What this step does: We run NMDS multiple times using 2, 3, 4, and 5 dimensions. For each dimensionality, metaMDS() performs multiple random starts (tries = 20) to find the best solution, iterating until stress stabilizes. The function returns the configuration (site positions) and stress value for each dimensionality.

Why we do this: Unlike PCA/PCoA where the number of meaningful axes is determined by the data’s variance structure, in NMDS you must specify the number of dimensions beforehand. Different dimensionalities trade off between simplicity and goodness-of-fit. Testing multiple dimensions helps us find the optimal balance: too few dimensions yield high stress (poor fit), while too many dimensions add complexity without meaningful improvement. Multiple random starts are crucial because NMDS optimization can get stuck in local minima - metaMDS() runs many attempts and keeps the best solution, ensuring we find a global (or near-global) optimum.

# Run NMDS with different numbers of dimensions
dimensions <- c(2, 3, 4, 5)
nmds_results <- list()
stress_values <- numeric(length(dimensions))

cat("Running NMDS with different numbers of dimensions...\n",
    "(Each run performs multiple random starts - this may take a moment)\n\n",
    "This will take a long time\n")
Running NMDS with different numbers of dimensions...
 (Each run performs multiple random starts - this may take a moment)

 This will take a long time
for (i in seq_along(dimensions)) {
  k <- dimensions[i]
  cat("Running NMDS with k =", k, "dimensions...\n")
  
  # Run metaMDS with specified dimensions
  # trace = FALSE suppresses iteration output
  # autotransform = FALSE because we already scaled our data
  # trymax = 20 performs 20 random starts
  nmds_results[[i]] <- metaMDS(euclidean_dist, k = k, 
                                trace = FALSE, 
                                autotransform = FALSE,
                                trymax = 20)
  
  stress_values[i] <- nmds_results[[i]]$stress
  cat("  Final stress =", round(stress_values[i], 4), "\n")
  
  # Interpret stress
  if (stress_values[i] < 0.05) {
    cat("  Interpretation: Excellent representation\n\n")
  } else if (stress_values[i] < 0.10) {
    cat("  Interpretation: Good representation\n\n")
  } else if (stress_values[i] < 0.20) {
    cat("  Interpretation: Acceptable representation\n\n")
  } else {
    cat("  Interpretation: Poor representation - consider more dimensions\n\n")
  }
}
Running NMDS with k = 2 dimensions...
  Final stress = 0.1088 
  Interpretation: Acceptable representation

Running NMDS with k = 3 dimensions...
  Final stress = 0.0664 
  Interpretation: Good representation

Running NMDS with k = 4 dimensions...
  Final stress = 0.0423 
  Interpretation: Excellent representation

Running NMDS with k = 5 dimensions...
  Final stress = 0.0246 
  Interpretation: Excellent representation

5.3 Step 3.2: Plot Stress vs Number of Dimensions

What this step does: We create a stress plot showing how goodness-of-fit changes with dimensionality. The plot includes reference lines at stress = 0.05, 0.10, and 0.20 to help interpret fit quality. We also identify the minimum number of dimensions needed to achieve acceptable stress (< 0.20).

Why we do this: The stress plot reveals the trade-off between model complexity and fit quality. It helps us answer: “How many dimensions do we need?” Key considerations: - Diminishing returns: After a certain point, additional dimensions improve stress only marginally - Interpretability: 2-3 dimensions are easy to visualize; higher dimensions are harder to interpret - Overfitting: Too many dimensions can fit noise rather than signal

The standard thresholds (0.05, 0.10, 0.20) come from decades of ecological practice. Stress < 0.10 is generally considered good for reliable interpretation, while stress > 0.20 suggests the ordination may be misleading. The optimal choice balances statistical fit with practical interpretability - often this means accepting slightly higher stress to gain easier visualization in 2-3 dimensions.

# Create stress plot
plot(dimensions, stress_values,
     type = "b", pch = 19, col = "purple", lwd = 2,
     xlab = "Number of Dimensions",
     ylab = "Stress",
     main = "NMDS: Stress vs Dimensionality",
     ylim = c(0, max(stress_values) * 1.1))

# Add reference lines
abline(h = 0.20, lty = 2, col = "red", lwd = 2)
abline(h = 0.10, lty = 2, col = "orange", lwd = 2)
abline(h = 0.05, lty = 2, col = "green", lwd = 2)

# Add labels
text(max(dimensions), 0.20, "Acceptable (0.20)", pos = 2, col = "red", cex = 0.9)
text(max(dimensions), 0.10, "Good (0.10)", pos = 2, col = "orange", cex = 0.9)
text(max(dimensions), 0.05, "Excellent (0.05)", pos = 2, col = "green", cex = 0.9)
grid()

# Determine optimal dimensions
if (any(stress_values < 0.20)) {
  optimal_k <- dimensions[which(stress_values < 0.20)[1]]
  cat("\nOptimal number of dimensions (stress < 0.20):", optimal_k, "\n")
} else {
  cat("\nWarning: No dimensionality achieved stress < 0.20\n")
  cat("Consider testing higher dimensions or using a different distance metric.\n")
  optimal_k <- dimensions[which.min(stress_values)]
  cat("Using k =", optimal_k, "with lowest stress =", round(min(stress_values), 4), "\n")
}

Optimal number of dimensions (stress < 0.20): 2 
cat("\nStress values summary:\n")

Stress values summary:
for (i in seq_along(dimensions)) {
  cat("  k =", dimensions[i], ": stress =", round(stress_values[i], 4), "\n")
}
  k = 2 : stress = 0.1088 
  k = 3 : stress = 0.0664 
  k = 4 : stress = 0.0423 
  k = 5 : stress = 0.0246 

5.4 Step 3.3: Visualize NMDS Ordination

What this step does: We visualize the NMDS solution using the dimensionality that achieved acceptable stress (typically 5 dimensions for visualization purposes). We plot the first three axes in three different 2D combinations: NMDS1 vs NMDS2, NMDS1 vs NMDS3, and NMDS2 vs NMDS3, with points colored by biome type.

Why we do this: Visualization reveals whether NMDS successfully arranges sites so that environmentally similar locations are close together and dissimilar ones are far apart. By coloring points by biome, we assess whether NMDS ordination space reflects ecological classification - strong clustering by biome suggests the climate variables that drive NMDS axes also drive vegetation patterns.

Important differences from PCA/PCoA: 1. No variance explained: NMDS axes don’t capture proportions of variance - all axes are equally important 2. Arbitrary rotation: The orientation and flip of NMDS axes is arbitrary; only relative positions matter 3. Rank-based: NMDS preserves rank order of dissimilarities, not absolute distances, making it robust to non-linear relationships

The stress value shown in the plot title indicates trustworthiness - lower stress means the 2D projection accurately represents the multidimensional relationships. Multiple views (different axis pairs) show different aspects of the ordination space.

# Select the NMDS result with 5 dimensions for visualization
# (or use optimal_k if it's different)
nmds_k5 <- nmds_results[[which(dimensions == 5)]]
nmds_scores <- nmds_k5$points

# Create visualizations
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

# NMDS1 vs NMDS2
plot(nmds_scores[, 1], nmds_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = "NMDS1",
     ylab = "NMDS2",
     main = paste0("NMDS: Axis 1 vs 2 (Stress = ", 
                   round(nmds_k5$stress, 3), ")"))

# NMDS1 vs NMDS3
plot(nmds_scores[, 1], nmds_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = "NMDS1",
     ylab = "NMDS3",
     main = "NMDS: Axis 1 vs 3")

# NMDS2 vs NMDS3
plot(nmds_scores[, 2], nmds_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = "NMDS2",
     ylab = "NMDS3",
     main = "NMDS: Axis 2 vs 3")
#legend
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)

cat("\nNote: NMDS axes don't have inherent meaning or variance values.\n")

Note: NMDS axes don't have inherent meaning or variance values.
cat("All axes are equally valid - they represent different dimensions of the dissimilarity space.\n")
All axes are equally valid - they represent different dimensions of the dissimilarity space.
cat("The orientation (rotation/reflection) of the plot is arbitrary.\n")
The orientation (rotation/reflection) of the plot is arbitrary.

5.5 Step 3.4: Add Convex Hulls to NMDS

What this step does: We add convex hulls (polygons enclosing all points) around each biome in the NMDS ordination space. The hulls are drawn without fill to clearly show overlapping regions between different biomes.

Why we do this: Convex hulls reveal important ecological patterns in ordination space:

  1. Biome distinctiveness: Non-overlapping hulls indicate biomes with unique environmental signatures that NMDS successfully separates; overlapping hulls suggest similar environmental conditions or transitional zones

  2. Environmental breadth: Large hulls show biomes spanning wide environmental gradients (generalists); small hulls indicate narrow environmental tolerances (specialists)

  3. Niche relationships: The spatial arrangement of hulls shows which biomes occupy similar vs. different environmental niches

Comparing across methods: By now you’ve created hull plots for all three ordination techniques (PCA, PCoA, NMDS). Comparing them reveals: - Do all methods separate biomes similarly? This suggests robust, method-independent patterns - Do some biomes separate better in one method? PCA captures linear gradients, PCoA preserves distances, NMDS preserves rank orders - different biome-environment relationships favor different methods - Which method shows clearest separation overall? This suggests which mathematical framework best matches your data structure

Differences between methods illuminate whether biome-climate relationships are primarily linear (PCA works best), distance-based (PCoA works best), or rank-ordered (NMDS works best).

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

# NMDS1 vs NMDS2 with hulls
plot(nmds_scores[, 1], nmds_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = "NMDS1",
     ylab = "NMDS2",
     main = "NMDS with Convex Hulls: Axis 1 vs 2")
add_hulls(nmds_scores[, 1], nmds_scores[, 2],
          full_data$BIOME.Name[as.numeric(row.names(env_vars))],
          colors)


# NMDS1 vs NMDS3 with hulls
plot(nmds_scores[, 1], nmds_scores[, 3],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.6,
     xlab = "NMDS1",
     ylab = "NMDS3",
     main = "NMDS with Convex Hulls: Axis 1 vs 3")
add_hulls(nmds_scores[, 1], nmds_scores[, 3],
          full_data$BIOME.Name[as.numeric(row.names(env_vars))],
          colors)

# legend
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)
cat("\nCompare these convex hulls to your PCA and PCoA hulls from Tasks 1 and 2.\n")

Compare these convex hulls to your PCA and PCoA hulls from Tasks 1 and 2.
cat("Similarities suggest robust patterns; differences reveal which method best captures biome-environment relationships.\n")
Similarities suggest robust patterns; differences reveal which method best captures biome-environment relationships.


5.6 Reflection Questions - Task 3

Question 3.1: Based on the stress values you calculated, which number of dimensions provides the best balance between model complexity and goodness-of-fit? Would you consider the representation with your chosen number of dimensions to be excellent, good, or acceptable according to standard stress guidelines? Explain your reasoning.

Question 3.2: Compare the three ordination methods (PCA, PCoA, and NMDS) in terms of how clearly they separate the different biomes. Which method appears most effective for distinguishing vegetation types? Consider both the visual separation in ordination space and the conceptual differences between methods (linear vs. distance-based vs. rank-based). Why might different ordination techniques produce different patterns, and what does this tell you about the nature of biome-climate relationships?


6 Task 4: Non-hierarchical Classification (K-means Clustering)

6.1 Background and Objectives

What is K-means clustering?

K-means is a partitioning method that groups objects into a pre-specified number of clusters (k) by minimizing within-cluster variation. The algorithm iteratively assigns objects to the nearest cluster centroid and recalculates centroids until convergence.

Why use ordination spaces for clustering?

Ordination reduces noise and collinearity in the data, making patterns more distinct. Clustering in ordination space often produces more interpretable and stable groups than clustering in the original high-dimensional space.

What we will do:

  1. Apply K-means clustering to PCA, PCoA, and NMDS ordination spaces
  2. Compare classifications across different ordination methods
  3. Determine the optimal number of clusters for PCoA using multiple criteria
  4. Test whether clusters are statistically different using PERMANOVA
  5. Assess classification accuracy relative to known biome types

6.2 Step 4.1: K-means on All Three Ordination Spaces

What this step does: This step applies K-means clustering to each of the three ordination spaces (PCA, PCoA, and NMDS) we created in previous tasks. We set the number of clusters equal to the number of known biomes in our dataset to see how well unsupervised clustering can recover the known ecological classifications.

Why we do this: By comparing clustering across different ordination methods, we can assess which ordination technique best captures the natural groupings in the climate-vegetation data. Each ordination method has different properties (PCA assumes linear relationships, PCoA preserves distances, NMDS preserves rank orders), so they may produce different clustering results.

Key outputs: - Total within-cluster SS: Measures how compact the clusters are (lower is better) - Between-cluster SS: Measures how separated the clusters are (higher is better) - Ratio (between/total): The proportion of variance explained by the clustering (higher values indicate better-defined clusters)

# Number of known biomes
n_biomes <- length(unique(full_data$BIOME.Name))
cat("Number of biomes in dataset:", n_biomes, "\n")
Number of biomes in dataset: 16 
# K-means on PCA space (using retained axes)
set.seed(42)
kmeans_pca <- kmeans(pc_scores[,1:n_axes_retain], centers = n_biomes, nstart = 25)

# K-means on PCoA space
set.seed(42)
kmeans_pcoa <- kmeans(pcoa_scores[, 1:n_axes_pcoa], centers = n_biomes, nstart = 25)

# K-means on NMDS space (using 5 dimensions)
set.seed(42)
kmeans_nmds <- kmeans(nmds_scores, centers = n_biomes, nstart = 25)

# Summary of results
cat("\n--- K-means Results ---\n")

--- K-means Results ---
cat("PCA - Total within-cluster SS:", round(kmeans_pca$tot.withinss, 2), "\n")
PCA - Total within-cluster SS: 1955.82 
cat("PCA - Between-cluster SS:", round(kmeans_pca$betweenss, 2), "\n")
PCA - Between-cluster SS: 23584.11 
cat("PCA - Ratio (between/total):", round(kmeans_pca$betweenss / kmeans_pca$totss, 3), "\n\n")
PCA - Ratio (between/total): 0.923 
cat("PCoA - Total within-cluster SS:", round(kmeans_pcoa$tot.withinss, 2), "\n")
PCoA - Total within-cluster SS: 1955.82 
cat("PCoA - Between-cluster SS:", round(kmeans_pcoa$betweenss, 2), "\n")
PCoA - Between-cluster SS: 23584.11 
cat("PCoA - Ratio (between/total):", round(kmeans_pcoa$betweenss / kmeans_pcoa$totss, 3), "\n\n")
PCoA - Ratio (between/total): 0.923 
cat("NMDS - Total within-cluster SS:", round(kmeans_nmds$tot.withinss, 2), "\n")
NMDS - Total within-cluster SS: 3515.71 
cat("NMDS - Between-cluster SS:", round(kmeans_nmds$betweenss, 2), "\n")
NMDS - Between-cluster SS: 21675.95 
cat("NMDS - Ratio (between/total):", round(kmeans_nmds$betweenss / kmeans_nmds$totss, 3), "\n")
NMDS - Ratio (between/total): 0.86 

6.3 Step 4.2: Visualize K-means Classifications

What this step does: Creates side-by-side visualizations showing the original biome classifications (left panels) and the K-means cluster assignments (right panels) for each ordination method. Black asterisks mark the cluster centroids (the center points of each cluster).

Why we do this: Visual comparison allows us to quickly assess how well the unsupervised K-means clustering recovers the known biome patterns. Good clustering should show distinct, separated groups that align reasonably well with the original biome classifications.

What to look for: - Do the K-means clusters form distinct groups or overlap extensively? - Are the cluster centroids well-separated from each other? - Do the clustering patterns look similar across the three ordination methods? - Where K-means clusters align with biome boundaries and where they differ?

par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))

# PCA - original biomes
plot(pc_scores[, 1], pc_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.8,
     xlab = "PC1", ylab = "PC2",
     main = "PCA: Original Biomes")

# PCA - K-means clusters
plot(pc_scores[, 1], pc_scores[, 2],
     col = kmeans_pca$cluster,
     pch = 16, cex = 0.8,
     xlab = "PC1", ylab = "PC2",
     main = "PCA: K-means Clusters")
points(kmeans_pca$centers[, 1], kmeans_pca$centers[, 2],
       pch = 8, cex = 2, lwd = 2, col = "black")

# PCoA - original biomes
plot(pcoa_scores[, 1], pcoa_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.8,
     xlab = "PCo1", ylab = "PCo2",
     main = "PCoA: Original Biomes")

# PCoA - K-means clusters
plot(pcoa_scores[, 1], pcoa_scores[, 2],
     col = kmeans_pcoa$cluster,
     pch = 16, cex = 0.8,
     xlab = "PCo1", ylab = "PCo2",
     main = "PCoA: K-means Clusters")
points(kmeans_pcoa$centers[, 1], kmeans_pcoa$centers[, 2],
       pch = 8, cex = 2, lwd = 2, col = "black")

# NMDS - original biomes
plot(nmds_scores[, 1], nmds_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.8,
     xlab = "NMDS1", ylab = "NMDS2",
     main = "NMDS: Original Biomes")

# NMDS - K-means clusters
plot(nmds_scores[, 1], nmds_scores[, 2],
     col = kmeans_nmds$cluster,
     pch = 16, cex = 0.8,
     xlab = "NMDS1", ylab = "NMDS2",
     main = "NMDS: K-means Clusters")
points(kmeans_nmds$centers[, 1], kmeans_nmds$centers[, 2],
       pch = 8, cex = 2, lwd = 2, col = "black")


6.4 Step 4.3: Determine Optimal Number of Clusters (PCoA space)

What this step does: Tests different numbers of clusters (k = 2 to 15) using two methods to find the “natural” number of groups in the data: 1. Elbow method: Plots how within-cluster variation decreases as k increases; the “elbow” (point where improvement slows) suggests optimal k 2. Silhouette method: Calculates how well each point fits its assigned cluster compared to other clusters; values range from -1 (poor fit) to +1 (excellent fit)

Why we do this: The true number of ecological groups may differ from the number of named biomes. These methods help us discover the number of clusters that best fits the underlying data structure without being constrained by pre-existing classification schemes.

What to look for: - In the elbow plot: Where does the curve flatten out? - In the silhouette plot: Which k value gives the highest average silhouette score? - Do these methods agree on an optimal k?

# Test different numbers of clusters
k_values <- 2:15
wss <- numeric(length(k_values))
silhouette_scores <- numeric(length(k_values))

set.seed(42)
for (i in seq_along(k_values)) {
  k <- k_values[i]
  km_temp <- kmeans(pcoa_scores[, 1:n_axes_pcoa], centers = k, nstart = 25)
  wss[i] <- km_temp$tot.withinss
  
  # Calculate silhouette score
  cluster_assignments <- km_temp$cluster
  sil_scores <- numeric(nrow(pcoa_scores))
  
  for (j in 1:nrow(pcoa_scores)) {
    # Calculate average distance to own cluster
    own_cluster <- cluster_assignments[j]
    own_cluster_points <- pcoa_scores[cluster_assignments == own_cluster, 1:n_axes_pcoa]
    a_i <- mean(sqrt(rowSums((sweep(own_cluster_points, 2, pcoa_scores[j, 1:n_axes_pcoa]))^2)))
    
    # Calculate average distance to nearest cluster
    other_clusters <- unique(cluster_assignments[cluster_assignments != own_cluster])
    b_values <- numeric(length(other_clusters))
    for (c_idx in seq_along(other_clusters)) {
      other_cluster_points <- pcoa_scores[cluster_assignments == other_clusters[c_idx], 1:n_axes_pcoa]
      b_values[c_idx] <- mean(sqrt(rowSums((sweep(other_cluster_points, 2, pcoa_scores[j, 1:n_axes_pcoa]))^2)))
    }
    b_i <- min(b_values)
    
    sil_scores[j] <- (b_i - a_i) / max(a_i, b_i)
  }
  silhouette_scores[i] <- mean(sil_scores)
}

# Plot results
par(mfrow = c(1, 2), mar = c(4, 4, 3, 2))

# Elbow plot
plot(k_values, wss,
     type = "b", pch = 19, col = "blue",
     xlab = "Number of Clusters (k)",
     ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Method")
grid()

# Silhouette plot
plot(k_values, silhouette_scores,
     type = "b", pch = 19, col = "darkgreen",
     xlab = "Number of Clusters (k)",
     ylab = "Average Silhouette Score",
     main = "Silhouette Method")
abline(h = 0, lty = 2, col = "red")
grid()

# Identify optimal k
optimal_k_silhouette <- k_values[which.max(silhouette_scores)]
cat("\nOptimal number of clusters (Silhouette):", optimal_k_silhouette, "\n")

Optimal number of clusters (Silhouette): 4 
cat("Maximum Silhouette score:", round(max(silhouette_scores), 3), "\n")
Maximum Silhouette score: 0.413 

6.5 Step 4.4: Run K-means with Optimal Number of Clusters

What this step does: Performs K-means clustering using the optimal number of clusters identified in Step 4.3. Creates visualizations comparing this data-driven clustering approach with the original biome classifications, and generates a confusion matrix showing how cluster assignments correspond to actual biomes.

Why we do this: This allows us to compare a classification based purely on the data’s natural structure versus the traditional expert-defined biome system. The confusion matrix reveals which biomes are most distinct and which tend to blend together based on climate variables.

What to look for: - How does the optimal k compare to the number of original biomes? - In the confusion matrix: Are certain biomes consistently assigned to single clusters, or are they split across multiple clusters? - Do some clusters contain sites from multiple different biomes?

# Run K-means with optimal k
set.seed(42)
kmeans_optimal <- kmeans(pcoa_scores[, 1:n_axes_pcoa], 
                         centers = optimal_k_silhouette, 
                         nstart = 25)

# Visualize optimal clustering
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

# Original biomes
plot(pcoa_scores[, 1], pcoa_scores[, 2],
     col = color_map[full_data$BIOME.Name],
     pch = 16, cex = 0.8,
     xlab = "PCo1", ylab = "PCo2",
     main = paste("PCoA: Original", n_biomes, "Biomes"))


# Optimal clusters
plot(pcoa_scores[, 1], pcoa_scores[, 2],
     col = kmeans_optimal$cluster,
     pch = 16, cex = 0.8,
     xlab = "PCo1", ylab = "PCo2",
     main = paste("PCoA: Optimal K-means (k =", optimal_k_silhouette, ")"))
points(kmeans_optimal$centers[, 1], kmeans_optimal$centers[, 2],
       pch = 8, cex = 2, lwd = 2, col = "black")
# legend 
plot.new()
legend("topright", legend = biome_types, col = colors, pch = 16, cex = 1)

# Create confusion matrix
cat("\n--- Confusion Matrix: Optimal Clusters vs Original Biomes ---\n")

--- Confusion Matrix: Optimal Clusters vs Original Biomes ---
confusion_matrix <- table(Cluster = kmeans_optimal$cluster, 
                          Biome = full_data$BIOME.Name[as.numeric(row.names(env_vars))])
print(confusion_matrix)
       Biome
Cluster Boreal Forests/Taiga Deserts & Xeric Shrublands
      1                    0                          0
      2                  100                         33
      3                    0                         67
      4                    0                          0
       Biome
Cluster Flooded Grasslands & Savannas Mangroves
      1                             0         0
      2                            20         0
      3                            79        41
      4                             1        59
       Biome
Cluster Mediterranean Forests, Woodlands & Scrub
      1                                        0
      2                                       22
      3                                       78
      4                                        0
       Biome
Cluster Montane Grasslands & Shrublands Rock and Ice
      1                               1           98
      2                              64            2
      3                              33            0
      4                               2            0
       Biome
Cluster Temperate Broadleaf & Mixed Forests Temperate Conifer Forests
      1                                   0                         0
      2                                  87                        70
      3                                  11                        22
      4                                   2                         8
       Biome
Cluster Temperate Grasslands, Savannas & Shrublands
      1                                           0
      2                                          86
      3                                          14
      4                                           0
       Biome
Cluster Tropical & Subtropical Coniferous Forests
      1                                         0
      2                                         0
      3                                        88
      4                                        12
       Biome
Cluster Tropical & Subtropical Dry Broadleaf Forests
      1                                            0
      2                                            0
      3                                           92
      4                                            8
       Biome
Cluster Tropical & Subtropical Grasslands, Savannas & Shrublands
      1                                                        0
      2                                                        0
      3                                                       98
      4                                                        2
       Biome
Cluster Tropical & Subtropical Moist Broadleaf Forests Tundra Water
      1                                              0     36     0
      2                                              0     62    80
      3                                             27      0    20
      4                                             73      0     0


6.6 Step 4.5: Assess Classification Accuracy

What this step does: Calculates how accurately each clustering method recovers the original biome classifications. For each cluster, the code identifies which biome is most common in that cluster and assigns all members to that biome. Accuracy is the percentage of sites correctly classified.

Why we do this: This quantifies the agreement between unsupervised clustering (based purely on climate patterns) and expert-defined biomes. High accuracy suggests biomes are climatically distinct; lower accuracy might indicate that biome boundaries are influenced by factors beyond climate, or that biomes exist along continuous gradients rather than as discrete categories.

What to look for: - Which ordination method produces the most accurate clustering? - How does accuracy change with the optimal k versus the original number of biomes? - Which biomes are most consistently assigned to single clusters? - Which biomes are frequently confused with each other?

# Calculate classification accuracy for each method
calculate_accuracy <- function(clusters, true_biomes) {
  # Find best matching between clusters and biomes
  confusion <- table(clusters, true_biomes)
  
  # For each cluster, assign it to the most common biome
  cluster_to_biome <- apply(confusion, 1, function(x) colnames(confusion)[which.max(x)])
  
  # Calculate accuracy
  predicted_biomes <- cluster_to_biome[as.character(clusters)]
  accuracy <- sum(predicted_biomes == true_biomes) / length(true_biomes)
  
  return(list(accuracy = accuracy, 
              cluster_assignment = cluster_to_biome,
              confusion = confusion))
}

# Calculate accuracies
acc_pca <- calculate_accuracy(kmeans_pca$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])
acc_pcoa <- calculate_accuracy(kmeans_pcoa$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])
acc_nmds <- calculate_accuracy(kmeans_nmds$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])
acc_optimal <- calculate_accuracy(kmeans_optimal$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])

cat("\n--- Classification Accuracy ---\n")

--- Classification Accuracy ---
cat("PCA K-means (", n_biomes, " clusters): ", round(acc_pca$accuracy * 100, 2), "%\n", sep = "")
PCA K-means (16 clusters): 41.61%
cat("PCoA K-means (", n_biomes, " clusters): ", round(acc_pcoa$accuracy * 100, 2), "%\n", sep = "")
PCoA K-means (16 clusters): 41.61%
cat("NMDS K-means (", n_biomes, " clusters): ", round(acc_nmds$accuracy * 100, 2), "%\n", sep = "")
NMDS K-means (16 clusters): 43.49%
cat("PCoA K-means (", optimal_k_silhouette, " clusters): ", round(acc_optimal$accuracy * 100, 2), "%\n", sep = "")
PCoA K-means (4 clusters): 23.09%
# Detailed confusion matrix for PCoA with n_biomes clusters
cat("\n--- Detailed Confusion Matrix: PCoA K-means ---\n")

--- Detailed Confusion Matrix: PCoA K-means ---
print(acc_pcoa$confusion)
        true_biomes
clusters Boreal Forests/Taiga Deserts & Xeric Shrublands
      1                     0                          2
      2                     0                         65
      3                     0                          0
      4                     0                          0
      5                     0                          4
      6                    35                          1
      7                     0                          0
      8                     0                          0
      9                    20                          0
      10                    0                          0
      11                    0                          1
      12                    0                          7
      13                    8                          7
      14                    0                          0
      15                    0                          0
      16                   37                         13
        true_biomes
clusters Flooded Grasslands & Savannas Mangroves
      1                             38        13
      2                             14         0
      3                              0        27
      4                              0        21
      5                              2         0
      6                              0         0
      7                              4         3
      8                              5        32
      9                              0         0
      10                             0         0
      11                             0         0
      12                            17         4
      13                             2         0
      14                             0         0
      15                             0         0
      16                            18         0
        true_biomes
clusters Mediterranean Forests, Woodlands & Scrub
      1                                         3
      2                                        26
      3                                         0
      4                                         0
      5                                        60
      6                                         0
      7                                         3
      8                                         0
      9                                         5
      10                                        0
      11                                        1
      12                                        1
      13                                        0
      14                                        0
      15                                        1
      16                                        0
        true_biomes
clusters Montane Grasslands & Shrublands Rock and Ice
      1                                5            0
      2                                7            0
      3                                0            0
      4                                1            0
      5                                2            0
      6                                0            1
      7                                1            0
      8                                1            0
      9                                0            0
      10                               0           21
      11                              12            0
      12                               4            0
      13                              17            0
      14                               0           78
      15                              48            0
      16                               2            0
        true_biomes
clusters Temperate Broadleaf & Mixed Forests Temperate Conifer Forests
      1                                    0                         1
      2                                    1                         3
      3                                    0                         1
      4                                    1                         0
      5                                   13                        16
      6                                    0                         0
      7                                    8                        17
      8                                    2                         0
      9                                   51                        16
      10                                   0                         0
      11                                   0                         5
      12                                   0                         0
      13                                  10                        29
      14                                   0                         0
      15                                   0                         3
      16                                  14                         9
        true_biomes
clusters Temperate Grasslands, Savannas & Shrublands
      1                                            0
      2                                            4
      3                                            0
      4                                            0
      5                                           26
      6                                            0
      7                                            2
      8                                            0
      9                                           11
      10                                           0
      11                                           0
      12                                           0
      13                                          25
      14                                           0
      15                                           0
      16                                          32
        true_biomes
clusters Tropical & Subtropical Coniferous Forests
      1                                         29
      2                                          5
      3                                          4
      4                                          1
      5                                          2
      6                                          0
      7                                          1
      8                                         14
      9                                          0
      10                                         0
      11                                        27
      12                                        17
      13                                         0
      14                                         0
      15                                         0
      16                                         0
        true_biomes
clusters Tropical & Subtropical Dry Broadleaf Forests
      1                                            42
      2                                             0
      3                                             1
      4                                             0
      5                                             0
      6                                             0
      7                                             0
      8                                            15
      9                                             0
      10                                            0
      11                                            3
      12                                           39
      13                                            0
      14                                            0
      15                                            0
      16                                            0
        true_biomes
clusters Tropical & Subtropical Grasslands, Savannas & Shrublands
      1                                                        30
      2                                                         2
      3                                                         1
      4                                                         0
      5                                                         4
      6                                                         0
      7                                                         1
      8                                                         9
      9                                                         0
      10                                                        0
      11                                                        3
      12                                                       50
      13                                                        0
      14                                                        0
      15                                                        0
      16                                                        0
        true_biomes
clusters Tropical & Subtropical Moist Broadleaf Forests Tundra Water
      1                                              12      0     8
      2                                               0      0     0
      3                                              26      0     0
      4                                              19      0     0
      5                                               0      0    19
      6                                               0     38    10
      7                                               9      0     2
      8                                              28      0     1
      9                                               0      7    24
      10                                              0     43     0
      11                                              4      0     3
      12                                              2      0     3
      13                                              0      2     3
      14                                              0      6     0
      15                                              0      0     0
      16                                              0      2    27
cat("\nCluster to Biome Assignment:\n")

Cluster to Biome Assignment:
for (i in 1:length(acc_pcoa$cluster_assignment)) {
  cat("Cluster", i, "->", acc_pcoa$cluster_assignment[i], "\n")
}
Cluster 1 -> Tropical & Subtropical Dry Broadleaf Forests 
Cluster 2 -> Deserts & Xeric Shrublands 
Cluster 3 -> Mangroves 
Cluster 4 -> Mangroves 
Cluster 5 -> Mediterranean Forests, Woodlands & Scrub 
Cluster 6 -> Tundra 
Cluster 7 -> Temperate Conifer Forests 
Cluster 8 -> Mangroves 
Cluster 9 -> Temperate Broadleaf & Mixed Forests 
Cluster 10 -> Tundra 
Cluster 11 -> Tropical & Subtropical Coniferous Forests 
Cluster 12 -> Tropical & Subtropical Grasslands, Savannas & Shrublands 
Cluster 13 -> Temperate Conifer Forests 
Cluster 14 -> Rock and Ice 
Cluster 15 -> Montane Grasslands & Shrublands 
Cluster 16 -> Boreal Forests/Taiga 

6.7 Reflection Questions - Task 4

Question 4.1: Compare the classification accuracies across the three ordination methods (PCA, PCoA, NMDS). Which ordination space produced the most accurate K-means clustering? Why might certain ordination methods be more suitable for classification tasks than others?

Question 4.2: You determined an optimal number of clusters using the silhouette method, which may differ from the actual number of biomes in the dataset. What does this suggest about the natural structure of vegetation types based on climate data? Discuss whether biomes are truly discrete categories or whether they represent points along continuous environmental gradients. How might this impact conservation and vegetation mapping efforts?


7 Summary and Conclusions

7.1 Key Findings

Today you have learned to:

  1. Apply ordination techniques - PCA, PCoA, and NMDS each reveal different aspects of how environmental variables structure vegetation distributions
  2. Determine dimensionality - Using scree plots, broken stick models, and stress values to select appropriate numbers of axes
  3. Classify ecological data - K-means clustering can identify natural groupings, with performance varying by ordination method
  4. Validate classifications - PERMANOVA and accuracy metrics help assess whether groups are statistically and ecologically meaningful

7.2 Take-Home Messages

  • No single best method: Different ordination techniques have different strengths. PCA is best for linear relationships, PCoA for any dissimilarity measure, and NMDS for non-linear patterns.
  • Dimensionality matters: Retaining too few axes loses information; retaining too many adds noise.
  • Classification is challenging: Even with clear environmental gradients, perfectly classifying vegetation types is difficult because ecosystems exist along continua rather than in discrete boxes.
  • Statistical validation is essential: Always test whether your groups are significantly different and evaluate classification accuracy.

7.3 Further Exploration

Consider these extensions:

  • Use the classified model to predict vegetation types at unsampled locations using WorldClimateData.csv
  • Apply hierarchical clustering methods and compare to K-means
  • Investigate which bioclimatic variables best discriminate deciduous broadleaf forests
  • Explore random forests or other machine learning approaches for vegetation classification